Provenance for Data Mining

نویسندگان

  • Boris Glavic
  • Javed Siddique
  • Periklis Andritsos
  • Renée J. Miller
چکیده

Data mining aims at extracting useful information from large datasets. Most data mining approaches reduce the input data to produce a smaller output summarizing the mining result. While the purpose of data mining (extracting information) necessitates this reduction in size, the loss of information it entails can be problematic. Specifically, the results of data mining may be more confusing than insightful, if the user is not able to understand on which input data they are based and how they were created. In this paper, we argue that the user needs access to the provenance of mining results. Provenance, while extensively studied by the database, workflow, and distributed systems communities, has not yet been considered for data mining. We analyze the differences between database, workflow, and data mining provenance, suggest new types of provenance, and identify new usecases for provenance in data mining. To illustrate our ideas, we present a more detailed discussion of these concepts for two typical data mining algorithms: frequent itemset mining and multi-dimensional scaling. 1 Provenance for Data Mining While some related work from the data mining community has considered techniques for visualizing mining results [8], evaluating their interestingness [13], or detecting causal relationships [20], there are no tools that compute mining provenance, that is, the reasons for why and how a certain result was produced. Similar to provenance for databases or workflows, data mining provenance could be defined in different ways with different use-cases in mind. Before discussing the requirements, challenges, and use-cases, note that this paper focuses on provenance for data mining, which is unrelated to previous approaches that apply data mining techniques to compute or analyze provenance [7]. Many provenance models define provenance as a subset of the input data that caused an output of interest to appear in the result of a transformation. For example, a standard database provenance model named Whyprovenance [6] considers a set of input tuples of a query to be in the provenance of an output tuple if they are sufficient to derive the output through the query. Other provenance models use necessity, preservation of equivalence [12], or causality [6] to model these data dependencies between inputs and outputs. For simplicity, and lack of a better term, we will refer to all these models as forms of why-provenance. Why-provenance. The concepts underlying relational why-provenance models (sufficiency, necessity, preservation of equivalence, and causality) are also meaningful for data mining. Retrieving the inputs that influenced a result is especially useful for data mining, because most data mining algorithms generate a small and condensed result from a large input data set. While this reduction is in line with the purpose of data mining (finding useful information in data), it can be problematic, because data reduction is lossy. Provenance can help to selectively recover this information for an output of interest, thus, helping us to better understand the result. Efficiently generating why-provenance for data mining techniques may not be trivial, because of the large number of inputs that influence a result. Furthermore, unless we can generalize the processing of data mining algorithms in terms of their provenance behaviour, efficient approaches for provenance generation would have to be developed from scratch for each such algorithm. One idea to explore in this context is to model data mining operations as workflows or database queries and use standard finegrained provenance models from these application domains. However, this approach may not be applicable to all data mining algorithms and could be less efficient than specialized provenance tracking algorithms for data mining. While traditional why-provenance can be adapted for data mining, its usefulness is limited by the fact that

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Provenance, Tectonic Setting & Geochemical Maturity of The Early Miocene Pyawbwe Formation, Sakangyi –Thayet Area, Magway Region, Myanmar.

Abstract The best exposed Early Miocene (820 m. thick. ) shales and interbedded silty sandstones beds of the Pyawbwe Formation at Sakangyi- Thayat area,Magway Region are investigated geochemically by using Siemens SRS- X Ray 303 AS XRF Spectrometer. Major and some trace element concentrations have been determined to achieve their provenance, tectonic setting ,paleoweathering , paleoclimate and ...

متن کامل

Temporal Data Mining of Scientific Data Provenance

Provenance of digital scientific data is an important piece of the metadata of a data object. It can however grow voluminous quickly because the granularity level of capture can be high. It can also be quite feature rich. We propose a representation of the provenance data based on logical time that reduces the feature space. Creating time and frequency domain representations of the provenance, ...

متن کامل

Provenance as Data Mining: Combining File System Metadata with Content Analysis

Provenance describes how an object came to be in its present state. Thus, it describes the evolution of the object over time. Prior work on provenance has focussed on databases and the file system. The database or file system is enhanced or augmented in order to capture additional information about the historical evolution of document collections, and thus answer the provenance question. We add...

متن کامل

Big Data Provenance: State-Of-The-Art Analysis and Emerging Research Challenges

This paper focuses the attention on big data provenance issues, and provides a comprehensive survey on state-of-theart analysis and emerging research challenges in this scientific field. Big data provenance is actually one of the most relevant problem in big data research, as confirmed by the great deal of attention devoted to this topic by larger and larger database and data mining research co...

متن کامل

Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering

Data streams flowing from the physical environment are as unpredictable as the environment itself. Radars go down, long haul networks drop packets, and readings are corrupted on the wire. Yet the data driven scientific models and data mining algorithms do not necessarily account for the inaccuracies when assimilating the data. Low overhead provenance collection partially solves this problem. We...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013